
"Visit with us" is a toursim company and the Policy Maker of the company wants to enable and establish a viable business model to expand the customer base. One of the ways to expand the customer base is to introduce a new offering of packages.However, it was difficult to identify the potential customers because customers were contacted at random without looking at the available information.
The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being. This time company wants to harness the available data of existing and potential customers to target the right customers.
To predict which customer is more likely to purchase the newly introduced travel package.
As a Data Scientist at "Visit with us" travel company, the task is to analyze the customers' data and information to provide recommendations to the Policy Maker and build a model to predict the potential customer who is going to purchase the newly introduced travel package. The model will be built to make predictions before a customer is contacted.
The following questions to be worked upon while analyzing the data and for building model using Ensemble Techniques:
Let's start by importing libraries we need.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.ensemble import StackingClassifier
from xgboost import XGBClassifier
from sklearn import metrics
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn import tree
# read the dataset
url = 'https://raw.githubusercontent.com/Seprishi/EnsembleTechniques/main/Projects/Tourism.csv'
data = pd.read_csv(url)
# copying data to another varaible to avoid any changes to original data
df=data.copy()
View the first 5 rows of the dataset
data.head()
| CustomerID | ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 200000 | 1 | 41.0 | Self Enquiry | 3 | 6.0 | Salaried | Female | 3 | 3.0 | Deluxe | 3.0 | Single | 1.0 | 1 | 2 | 1 | 0.0 | Manager | 20993.0 |
| 1 | 200001 | 0 | 49.0 | Company Invited | 1 | 14.0 | Salaried | Male | 3 | 4.0 | Deluxe | 4.0 | Divorced | 2.0 | 0 | 3 | 1 | 2.0 | Manager | 20130.0 |
| 2 | 200002 | 1 | 37.0 | Self Enquiry | 1 | 8.0 | Free Lancer | Male | 3 | 4.0 | Basic | 3.0 | Single | 7.0 | 1 | 3 | 0 | 0.0 | Executive | 17090.0 |
| 3 | 200003 | 0 | 33.0 | Company Invited | 1 | 9.0 | Salaried | Female | 2 | 3.0 | Basic | 3.0 | Divorced | 2.0 | 1 | 5 | 1 | 1.0 | Executive | 17909.0 |
| 4 | 200004 | 0 | NaN | Self Enquiry | 1 | 8.0 | Small Business | Male | 2 | 3.0 | Basic | 4.0 | Divorced | 1.0 | 0 | 5 | 1 | 0.0 | Executive | 18468.0 |
View the last 5 rows of the dataset
data.tail()
| CustomerID | ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4883 | 204883 | 1 | 49.0 | Self Enquiry | 3 | 9.0 | Small Business | Male | 3 | 5.0 | Deluxe | 4.0 | Unmarried | 2.0 | 1 | 1 | 1 | 1.0 | Manager | 26576.0 |
| 4884 | 204884 | 1 | 28.0 | Company Invited | 1 | 31.0 | Salaried | Male | 4 | 5.0 | Basic | 3.0 | Single | 3.0 | 1 | 3 | 1 | 2.0 | Executive | 21212.0 |
| 4885 | 204885 | 1 | 52.0 | Self Enquiry | 3 | 17.0 | Salaried | Female | 4 | 4.0 | Standard | 4.0 | Married | 7.0 | 0 | 1 | 1 | 3.0 | Senior Manager | 31820.0 |
| 4886 | 204886 | 1 | 19.0 | Self Enquiry | 3 | 16.0 | Small Business | Male | 3 | 4.0 | Basic | 3.0 | Single | 3.0 | 0 | 5 | 0 | 2.0 | Executive | 20289.0 |
| 4887 | 204887 | 1 | 36.0 | Self Enquiry | 1 | 14.0 | Salaried | Male | 4 | 4.0 | Basic | 4.0 | Unmarried | 3.0 | 1 | 3 | 1 | 2.0 | Executive | 24041.0 |
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4888 entries, 0 to 4887 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerID 4888 non-null int64 1 ProdTaken 4888 non-null int64 2 Age 4662 non-null float64 3 TypeofContact 4863 non-null object 4 CityTier 4888 non-null int64 5 DurationOfPitch 4637 non-null float64 6 Occupation 4888 non-null object 7 Gender 4888 non-null object 8 NumberOfPersonVisiting 4888 non-null int64 9 NumberOfFollowups 4843 non-null float64 10 ProductPitched 4888 non-null object 11 PreferredPropertyStar 4862 non-null float64 12 MaritalStatus 4888 non-null object 13 NumberOfTrips 4748 non-null float64 14 Passport 4888 non-null int64 15 PitchSatisfactionScore 4888 non-null int64 16 OwnCar 4888 non-null int64 17 NumberOfChildrenVisiting 4822 non-null float64 18 Designation 4888 non-null object 19 MonthlyIncome 4655 non-null float64 dtypes: float64(7), int64(7), object(6) memory usage: 763.9+ KB
data.shape
(4888, 20)
data.isna().sum()
CustomerID 0 ProdTaken 0 Age 226 TypeofContact 25 CityTier 0 DurationOfPitch 251 Occupation 0 Gender 0 NumberOfPersonVisiting 0 NumberOfFollowups 45 ProductPitched 0 PreferredPropertyStar 26 MaritalStatus 0 NumberOfTrips 140 Passport 0 PitchSatisfactionScore 0 OwnCar 0 NumberOfChildrenVisiting 66 Designation 0 MonthlyIncome 233 dtype: int64
Data conversion to Categorical Data type
for feature in data.columns: # Loop through all columns in the dataframe
if data[feature].dtype == 'object': # Only apply for columns with categorical strings
data[feature] = pd.Categorical(data[feature])# Replace strings with an integer
data.head(10)
| CustomerID | ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 200000 | 1 | 41.0 | Self Enquiry | 3 | 6.0 | Salaried | Female | 3 | 3.0 | Deluxe | 3.0 | Single | 1.0 | 1 | 2 | 1 | 0.0 | Manager | 20993.0 |
| 1 | 200001 | 0 | 49.0 | Company Invited | 1 | 14.0 | Salaried | Male | 3 | 4.0 | Deluxe | 4.0 | Divorced | 2.0 | 0 | 3 | 1 | 2.0 | Manager | 20130.0 |
| 2 | 200002 | 1 | 37.0 | Self Enquiry | 1 | 8.0 | Free Lancer | Male | 3 | 4.0 | Basic | 3.0 | Single | 7.0 | 1 | 3 | 0 | 0.0 | Executive | 17090.0 |
| 3 | 200003 | 0 | 33.0 | Company Invited | 1 | 9.0 | Salaried | Female | 2 | 3.0 | Basic | 3.0 | Divorced | 2.0 | 1 | 5 | 1 | 1.0 | Executive | 17909.0 |
| 4 | 200004 | 0 | NaN | Self Enquiry | 1 | 8.0 | Small Business | Male | 2 | 3.0 | Basic | 4.0 | Divorced | 1.0 | 0 | 5 | 1 | 0.0 | Executive | 18468.0 |
| 5 | 200005 | 0 | 32.0 | Company Invited | 1 | 8.0 | Salaried | Male | 3 | 3.0 | Basic | 3.0 | Single | 1.0 | 0 | 5 | 1 | 1.0 | Executive | 18068.0 |
| 6 | 200006 | 0 | 59.0 | Self Enquiry | 1 | 9.0 | Small Business | Female | 2 | 2.0 | Basic | 5.0 | Divorced | 5.0 | 1 | 2 | 1 | 1.0 | Executive | 17670.0 |
| 7 | 200007 | 0 | 30.0 | Self Enquiry | 1 | 30.0 | Salaried | Male | 3 | 3.0 | Basic | 3.0 | Married | 2.0 | 0 | 2 | 0 | 1.0 | Executive | 17693.0 |
| 8 | 200008 | 0 | 38.0 | Company Invited | 1 | 29.0 | Salaried | Male | 2 | 4.0 | Standard | 3.0 | Unmarried | 1.0 | 0 | 3 | 0 | 0.0 | Senior Manager | 24526.0 |
| 9 | 200009 | 0 | 36.0 | Self Enquiry | 1 | 33.0 | Small Business | Male | 3 | 3.0 | Deluxe | 3.0 | Divorced | 7.0 | 0 | 3 | 1 | 0.0 | Manager | 20237.0 |
# creating list of category columns that are not object type
cat_cols = ["CityTier","ProdTaken","NumberOfPersonVisiting","Passport","PitchSatisfactionScore","OwnCar"]
data[cat_cols] = data[cat_cols].astype("category")
# selecting all object datatypes and converting to category
cols = data.select_dtypes(["object"])
for i in cols.columns:
data[i] = data[i].astype("category")
# check the dataset for updated datatypes
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4888 entries, 0 to 4887 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerID 4888 non-null int64 1 ProdTaken 4888 non-null category 2 Age 4662 non-null float64 3 TypeofContact 4863 non-null category 4 CityTier 4888 non-null category 5 DurationOfPitch 4637 non-null float64 6 Occupation 4888 non-null category 7 Gender 4888 non-null category 8 NumberOfPersonVisiting 4888 non-null category 9 NumberOfFollowups 4843 non-null float64 10 ProductPitched 4888 non-null category 11 PreferredPropertyStar 4862 non-null float64 12 MaritalStatus 4888 non-null category 13 NumberOfTrips 4748 non-null float64 14 Passport 4888 non-null category 15 PitchSatisfactionScore 4888 non-null category 16 OwnCar 4888 non-null category 17 NumberOfChildrenVisiting 4822 non-null float64 18 Designation 4888 non-null category 19 MonthlyIncome 4655 non-null float64 dtypes: category(12), float64(7), int64(1) memory usage: 364.9 KB
data['ProdTaken'].value_counts()
0 3968 1 920 Name: ProdTaken, dtype: int64
cat_cols1 = ['Designation','ProdTaken', 'OwnCar', 'Passport',
'CityTier','MaritalStatus', 'PreferredPropertyStar',
'ProductPitched','Gender','Occupation','TypeofContact'
]
for column in cat_cols1:
print('-'*30)
print(data[column].value_counts())
print('-'*30)
------------------------------ Executive 1842 Manager 1732 Senior Manager 742 AVP 342 VP 230 Name: Designation, dtype: int64 ------------------------------ ------------------------------ 0 3968 1 920 Name: ProdTaken, dtype: int64 ------------------------------ ------------------------------ 1 3032 0 1856 Name: OwnCar, dtype: int64 ------------------------------ ------------------------------ 0 3466 1 1422 Name: Passport, dtype: int64 ------------------------------ ------------------------------ 1 3190 3 1500 2 198 Name: CityTier, dtype: int64 ------------------------------ ------------------------------ Married 2340 Divorced 950 Single 916 Unmarried 682 Name: MaritalStatus, dtype: int64 ------------------------------ ------------------------------ 3.0 2993 5.0 956 4.0 913 Name: PreferredPropertyStar, dtype: int64 ------------------------------ ------------------------------ Basic 1842 Deluxe 1732 Standard 742 Super Deluxe 342 King 230 Name: ProductPitched, dtype: int64 ------------------------------ ------------------------------ Male 2916 Female 1817 Fe Male 155 Name: Gender, dtype: int64 ------------------------------ ------------------------------ Salaried 2368 Small Business 2084 Large Business 434 Free Lancer 2 Name: Occupation, dtype: int64 ------------------------------ ------------------------------ Self Enquiry 3444 Company Invited 1419 Name: TypeofContact, dtype: int64 ------------------------------
Observation
Merging of Entries
data['Gender'] = data['Gender'].apply(lambda x: 'Female' if x == 'Fe Male' else x)
data.Gender.value_counts()
Male 2916 Female 1972 Name: Gender, dtype: int64
Missing Value Treatment
data.isna().sum()
CustomerID 0 ProdTaken 0 Age 226 TypeofContact 25 CityTier 0 DurationOfPitch 251 Occupation 0 Gender 0 NumberOfPersonVisiting 0 NumberOfFollowups 45 ProductPitched 0 PreferredPropertyStar 26 MaritalStatus 0 NumberOfTrips 140 Passport 0 PitchSatisfactionScore 0 OwnCar 0 NumberOfChildrenVisiting 66 Designation 0 MonthlyIncome 233 dtype: int64
Observation
# replace the missing values with median income w.r.t the customer"s designation
data["MonthlyIncome"] = data.groupby(["Designation"])["MonthlyIncome"].transform(lambda x: x.fillna(x.median()))
data["Age"] = data.groupby(["Designation"])["Age"].transform(lambda x: x.fillna(x.median()))
data["NumberOfChildrenVisiting"] = data["NumberOfChildrenVisiting"].transform(lambda x: x.fillna(x.median()))
data["NumberOfTrips"] = data["NumberOfTrips"].transform(lambda x: x.fillna(x.median()))
data["PreferredPropertyStar"] = data["PreferredPropertyStar"].transform(lambda x: x.fillna(x.median()))
data["NumberOfFollowups"] = data["NumberOfFollowups"].transform(lambda x: x.fillna(x.median()))
data["DurationOfPitch"] = data["DurationOfPitch"].transform(lambda x: x.fillna(x.median()))
# treating missing values in remaining categorical variables
data["TypeofContact"] = df["TypeofContact"].fillna("Self Enquiry")
data.isna().sum()
CustomerID 0 ProdTaken 0 Age 0 TypeofContact 0 CityTier 0 DurationOfPitch 0 Occupation 0 Gender 0 NumberOfPersonVisiting 0 NumberOfFollowups 0 ProductPitched 0 PreferredPropertyStar 0 MaritalStatus 0 NumberOfTrips 0 Passport 0 PitchSatisfactionScore 0 OwnCar 0 NumberOfChildrenVisiting 0 Designation 0 MonthlyIncome 0 dtype: int64
Observation
# Summary of continuous columns
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CustomerID | 4888.0 | 202443.500000 | 1411.188388 | 200000.0 | 201221.75 | 202443.5 | 203665.25 | 204887.0 |
| Age | 4888.0 | 37.429828 | 9.149822 | 18.0 | 31.00 | 36.0 | 43.00 | 61.0 |
| DurationOfPitch | 4888.0 | 15.362930 | 8.316166 | 5.0 | 9.00 | 13.0 | 19.00 | 127.0 |
| NumberOfFollowups | 4888.0 | 3.711129 | 0.998271 | 1.0 | 3.00 | 4.0 | 4.00 | 6.0 |
| PreferredPropertyStar | 4888.0 | 3.577946 | 0.797005 | 3.0 | 3.00 | 3.0 | 4.00 | 5.0 |
| NumberOfTrips | 4888.0 | 3.229746 | 1.822769 | 1.0 | 2.00 | 3.0 | 4.00 | 22.0 |
| NumberOfChildrenVisiting | 4888.0 | 1.184738 | 0.852323 | 0.0 | 1.00 | 1.0 | 2.00 | 3.0 |
| MonthlyIncome | 4888.0 | 23546.843903 | 5266.279293 | 1000.0 | 20485.00 | 22413.5 | 25424.75 | 98678.0 |
# summary of categorical columns
data.describe(include="category").T
| count | unique | top | freq | |
|---|---|---|---|---|
| ProdTaken | 4888 | 2 | 0 | 3968 |
| CityTier | 4888 | 3 | 1 | 3190 |
| Occupation | 4888 | 4 | Salaried | 2368 |
| NumberOfPersonVisiting | 4888 | 5 | 3 | 2402 |
| ProductPitched | 4888 | 5 | Basic | 1842 |
| MaritalStatus | 4888 | 4 | Married | 2340 |
| Passport | 4888 | 2 | 0 | 3466 |
| PitchSatisfactionScore | 4888 | 5 | 3 | 1478 |
| OwnCar | 4888 | 2 | 1 | 3032 |
| Designation | 4888 | 5 | Executive | 1842 |
Observations
More analysis on Age and Monthly Income
data['Agebin'] = pd.cut(data['Age'], bins = [18,25, 31, 40, 50, 65], labels = ['18-25','26-30', '31-40', '41-50', '51-65'])
data.Agebin.value_counts()
31-40 1948 41-50 1073 26-30 971 51-65 549 18-25 333 Name: Agebin, dtype: int64
data.MonthlyIncome.describe()
count 4888.000000 mean 23546.843903 std 5266.279293 min 1000.000000 25% 20485.000000 50% 22413.500000 75% 25424.750000 max 98678.000000 Name: MonthlyIncome, dtype: float64
data['Incomebin'] = pd.cut(data['MonthlyIncome'], bins = [0,15000,20000, 25000, 30000,35000,40000,45000,50000,100000], labels = ['<15000', '<20000', '<25000', '<30000','<35000','<40000','<45000','<50000','<100000'])
data.Incomebin.value_counts()
<25000 2490 <20000 1038 <30000 768 <35000 382 <40000 206 <15000 2 <100000 2 <45000 0 <50000 0 Name: Incomebin, dtype: int64
Observation
Dropping of unwanted columns
#Dropping two columns from the dataframe
data.drop(columns=['CustomerID'],axis=1, inplace=True)
Univariate Analysis
def histogram_boxplot(data, feature, figsize=(7, 4), kde=True, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.distplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 0.5, 4))
else:
plt.figure(figsize=(n + 0.5, 4))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 3))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
histogram_boxplot(data,'Age')
histogram_boxplot(data,'DurationOfPitch')
histogram_boxplot(data,'NumberOfFollowups')
histogram_boxplot(data,'PreferredPropertyStar')
histogram_boxplot(data,'NumberOfTrips')
histogram_boxplot(data,'NumberOfChildrenVisiting')
histogram_boxplot(data,'MonthlyIncome')
Observations
labeled_barplot(data,'Agebin',perc=True)
labeled_barplot(data,'Incomebin',perc=True)
labeled_barplot(data,'Designation',perc=True)
labeled_barplot(data,'CityTier',perc=True)
labeled_barplot(data,'Occupation',perc=True)
labeled_barplot(data,'ProductPitched',perc=True)
labeled_barplot(data,'PreferredPropertyStar',perc=True)
labeled_barplot(data,'OwnCar',perc=True)
labeled_barplot(data,'Passport',perc=True)
labeled_barplot(data,'TypeofContact',perc=True)
labeled_barplot(data,'MaritalStatus',perc=True)
plt.figure(figsize=(3,1))
sns.catplot(x = 'TypeofContact', data = data, kind = 'count');
plt.title("TypeofContact")
plt.show()
<Figure size 300x100 with 0 Axes>
plt.figure(figsize=(3,1))
sns.catplot(x = 'Gender', data = data, kind = 'count');
plt.title("Gender")
plt.show()
<Figure size 300x100 with 0 Axes>
plt.figure(figsize=(3,1))
sns.catplot(x = 'CityTier', data = data, kind = 'count');
plt.title("CityTier")
plt.show()
<Figure size 300x100 with 0 Axes>
plt.figure(figsize=(3,1))
sns.catplot(x = 'Agebin', data = data, kind = 'count');
plt.title("Agebin")
plt.show()
<Figure size 300x100 with 0 Axes>
plt.figure(figsize=(14,7))
data['ProductPitched'].value_counts().plot.pie(autopct='%1.1f%%',figsize=(8,8))
plt.title("ProductPitched")
plt.show()
Observations
sns.lineplot(x = 'Age', y = 'MonthlyIncome', data = data, color = 'purple',ci = None).set(title='Lineplot for Age & Income')
[Text(0.5, 1.0, 'Lineplot for Age & Income')]
sns.lineplot(x = 'Age', y = 'DurationOfPitch', data = data, color = 'purple',ci = None).set(title='Lineplot for Age & PitchDuration')
[Text(0.5, 1.0, 'Lineplot for Age & PitchDuration')]
sns.lineplot(x = 'Age', y = 'PreferredPropertyStar', data = data, color = 'purple',ci = None).set(title='Lineplot for Age & PropertyStar')
[Text(0.5, 1.0, 'Lineplot for Age & PropertyStar')]
sns.lineplot(data=data, x='Age', y='MonthlyIncome', ci=False, hue='ProdTaken').set(title='Lineplot for Age and MonthlyIncome against ProdTaken');
Observation
# use the defined function stacked_barplot to plot the graphs
stacked_barplot(df, "TypeofContact", "ProdTaken")
ProdTaken 0 1 All TypeofContact All 3946 917 4863 Self Enquiry 2837 607 3444 Company Invited 1109 310 1419 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "CityTier", "ProdTaken")
ProdTaken 0 1 All CityTier All 3968 920 4888 1 2670 520 3190 3 1146 354 1500 2 152 46 198 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Occupation", "ProdTaken")
ProdTaken 0 1 All Occupation All 3968 920 4888 Salaried 1954 414 2368 Small Business 1700 384 2084 Large Business 314 120 434 Free Lancer 0 2 2 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Gender", "ProdTaken")
ProdTaken 0 1 All Gender All 3968 920 4888 Male 2338 578 2916 Female 1500 317 1817 Fe Male 130 25 155 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "NumberOfPersonVisiting", "ProdTaken")
ProdTaken 0 1 All NumberOfPersonVisiting All 3968 920 4888 3 1942 460 2402 2 1151 267 1418 4 833 193 1026 1 39 0 39 5 3 0 3 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "ProductPitched", "ProdTaken")
ProdTaken 0 1 All ProductPitched All 3968 920 4888 Basic 1290 552 1842 Deluxe 1528 204 1732 Standard 618 124 742 King 210 20 230 Super Deluxe 322 20 342 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "PreferredPropertyStar", "ProdTaken")
ProdTaken 0 1 All PreferredPropertyStar All 3948 914 4862 3.0 2511 482 2993 5.0 706 250 956 4.0 731 182 913 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "MaritalStatus", "ProdTaken")
ProdTaken 0 1 All MaritalStatus All 3968 920 4888 Married 2014 326 2340 Single 612 304 916 Unmarried 516 166 682 Divorced 826 124 950 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Passport", "ProdTaken")
ProdTaken 0 1 All Passport All 3968 920 4888 1 928 494 1422 0 3040 426 3466 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(df, "Designation", "ProdTaken")
ProdTaken 0 1 All Designation All 3968 920 4888 Executive 1290 552 1842 Manager 1528 204 1732 Senior Manager 618 124 742 AVP 322 20 342 VP 210 20 230 ------------------------------------------------------------------------------------------------------------------------
Observations
sns.pairplot(data=data, hue="ProdTaken")
plt.show()
cust_prof=data[data['ProdTaken']==1]
sns.countplot(x='Agebin',hue='ProductPitched',data=cust_prof).set_title('Agebin Product wise')
sns.despine(top=True,right=True,left=True)
sns.countplot(x='Incomebin',hue='ProductPitched',data=cust_prof).set_title('Incomebin Product Pitched')
sns.despine(top=True,right=True,left=True)
sns.countplot(x="ProductPitched", data=cust_prof)
sns.despine(top=True,right=True,left=True)
sns.barplot(y='Age',x='ProductPitched',data=cust_prof).set_title('Age vs Product Pitched')
sns.despine(top=True,right=True,left=True)
sns.countplot(x="ProductPitched", data=cust_prof, hue="Occupation")
sns.despine(top=True,right=True,left=True)
sns.countplot(x="ProductPitched", data=cust_prof, hue="Gender")
sns.despine(top=True,right=True,left=True)
sns.countplot(x="ProductPitched", data=cust_prof, hue="MaritalStatus")
sns.despine(top=True,right=True,left=True)
sns.barplot(y='MonthlyIncome',x='ProductPitched',data=cust_prof).set_title('Monthly Income vs Product Pitched')
sns.despine(top=True,right=True,left=True)
sns.barplot(x='Designation',y='MonthlyIncome',data=cust_prof,hue='ProductPitched').set_title('Designation vs Income')
sns.despine(top=True,right=True,left=True) # to remove side line from graph
plt.legend(bbox_to_anchor=(1, 1))
<matplotlib.legend.Legend at 0x7f223614f040>
sns.countplot(x="ProductPitched", data=cust_prof, hue="PreferredPropertyStar")
sns.despine(top=True,right=True,left=True)
sns.countplot(x="ProductPitched", data=cust_prof, hue="OwnCar")
sns.despine(top=True,right=True,left=True)
sns.countplot(x="ProductPitched", data=cust_prof, hue="CityTier")
sns.despine(top=True,right=True,left=True)
Correlation Check
sns.set(rc={'figure.figsize':(12,7)})
sns.heatmap(data.corr(),
annot=True,
linewidths=.5,
center=0,
cbar=False,
cmap="Spectral")
plt.show()
Observation
Outliers Treatment
Outliertreateddata = data.copy()
Q3 = Outliertreateddata['MonthlyIncome'].quantile(0.75)
Q1 = Outliertreateddata['MonthlyIncome'].quantile(0.25)
IQR = Q3-Q1
Outliertreateddata = Outliertreateddata[(Outliertreateddata['MonthlyIncome'] > Q1 - 1.5*IQR) & (Outliertreateddata['MonthlyIncome'] < Q3 + 1.5*IQR)]
Q3 = Outliertreateddata['NumberOfTrips'].quantile(0.75)
Q1 = Outliertreateddata['NumberOfTrips'].quantile(0.25)
IQR = Q3-Q1
Outliertreateddata = Outliertreateddata[(Outliertreateddata['NumberOfTrips'] > Q1 - 1.5*IQR) & (Outliertreateddata['NumberOfTrips'] < Q3 + 1.5*IQR)]
Q1 = data.quantile(0.25) #To find the 25th percentile and 75th percentile.
Q3 = data.quantile(0.75)
IQR = Q3 - Q1 #Inter Quantile Range (75th perentile - 25th percentile)
lower=Q1-1.5*IQR #Finding lower and upper bounds for all values. All values outside these bounds are outliers
upper=Q3+1.5*IQR
((data.select_dtypes(include=['float64','int64'])<lower) | (data.select_dtypes(include=['float64','int64'])>upper)).sum()/len(data)*100
Age 0.000000 DurationOfPitch 2.291326 NumberOfFollowups 6.382979 PreferredPropertyStar 0.000000 NumberOfTrips 2.229951 NumberOfChildrenVisiting 0.000000 MonthlyIncome 7.671849 dtype: float64
# Check MonthlyIncome extreme values
data.sort_values(by=["MonthlyIncome"],ascending = False).head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | ... | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | Agebin | Incomebin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2482 | 0 | 37.0 | Self Enquiry | 1 | 12.0 | Salaried | Female | 3 | 5.0 | Basic | ... | Divorced | 2.0 | 1 | 2 | 1 | 1.0 | Executive | 98678.0 | 31-40 | <100000 |
| 38 | 0 | 36.0 | Self Enquiry | 1 | 11.0 | Salaried | Female | 2 | 4.0 | Basic | ... | Divorced | 1.0 | 1 | 2 | 1 | 0.0 | Executive | 95000.0 | 31-40 | <100000 |
| 2634 | 0 | 53.0 | Self Enquiry | 1 | 7.0 | Salaried | Male | 4 | 5.0 | King | ... | Divorced | 2.0 | 0 | 2 | 1 | 2.0 | VP | 38677.0 | 51-65 | <40000 |
| 4104 | 0 | 53.0 | Self Enquiry | 1 | 7.0 | Salaried | Male | 4 | 5.0 | King | ... | Married | 2.0 | 0 | 1 | 1 | 3.0 | VP | 38677.0 | 51-65 | <40000 |
| 3190 | 0 | 42.0 | Company Invited | 1 | 14.0 | Salaried | Female | 3 | 6.0 | King | ... | Married | 3.0 | 0 | 4 | 1 | 1.0 | VP | 38651.0 | 41-50 | <40000 |
5 rows × 21 columns
# Check NumberOfTrips extreme values
data.sort_values(by=["NumberOfTrips"],ascending = False).head(5)
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | ... | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | Agebin | Incomebin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3260 | 0 | 40.0 | Company Invited | 1 | 16.0 | Salaried | Male | 4 | 4.0 | Deluxe | ... | Unmarried | 22.0 | 0 | 2 | 1 | 1.0 | Manager | 25460.0 | 31-40 | <30000 |
| 816 | 0 | 39.0 | Company Invited | 1 | 15.0 | Salaried | Male | 3 | 3.0 | Deluxe | ... | Unmarried | 21.0 | 0 | 2 | 1 | 0.0 | Manager | 21782.0 | 31-40 | <25000 |
| 2829 | 1 | 31.0 | Company Invited | 1 | 11.0 | Large Business | Male | 3 | 4.0 | Basic | ... | Single | 20.0 | 1 | 4 | 1 | 2.0 | Executive | 20963.0 | 26-30 | <25000 |
| 385 | 1 | 30.0 | Company Invited | 1 | 10.0 | Large Business | Male | 2 | 3.0 | Basic | ... | Single | 19.0 | 1 | 4 | 1 | 1.0 | Executive | 17285.0 | 26-30 | <20000 |
| 3155 | 1 | 30.0 | Self Enquiry | 1 | 17.0 | Salaried | Female | 4 | 5.0 | Basic | ... | Single | 8.0 | 1 | 5 | 1 | 2.0 | Executive | 21082.0 | 26-30 | <25000 |
5 rows × 21 columns
Removing these outliers form duration of pitch, monthly income, and number of trips
data.drop(index=data[data.DurationOfPitch>37].index,inplace=True)
#There are just 4 such observations with monthly income less than 12000 or greater than 40000
data.drop(index=data[(data.MonthlyIncome>40000) | (data.MonthlyIncome<12000)].index,inplace=True)
# There are just 4 such observations with number of trips greater than 8.
data.drop(index=data[data.NumberOfTrips>10].index,inplace=True)
Basic package
Deluxe package
King
SuperDeluxe
Standard package
As part of the EDA, Missing value treatment, Outlier detection and treatment, feature engineering are completed and we will focus on the data preparation on Model building and the evaluation criterion
The goal for the organization is to
For the above objectives, its important that both False positive and False negative values are low. Hence we would want the F1-Score to be maximized. The greater the F1-Score, greater the chances of predicting both classes correctly.
We will build following models, evaluate, tune and compare them with the metrics and derive the outcome from all the models:
# Separating features and the target column
X = data.drop(['ProdTaken','PitchSatisfactionScore','ProductPitched','NumberOfFollowups','DurationOfPitch','Agebin','Incomebin'],axis=1)
X = pd.get_dummies(X,drop_first=True)
y = data['ProdTaken']
# Splitting the data into train and test sets in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1, stratify=y)
X_train.shape, X_test.shape
((3414, 25), (1464, 25))
y.value_counts(1)
0 0.811808 1 0.188192 Name: ProdTaken, dtype: float64
y_test.value_counts(1)
0 0.811475 1 0.188525 Name: ProdTaken, dtype: float64
## Function to create confusion matrix
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
y_predict = model.predict(X_test)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
## Function to calculate different metric scores of the model - Accuracy, Recall and Precision
def get_metrics_score(model,flag=True):
'''
model : classifier to predict values of X
'''
# defining an empty list to store train and test results
score_list=[]
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
train_acc = model.score(X_train,y_train)
test_acc = model.score(X_test,y_test)
train_recall = metrics.recall_score(y_train,pred_train)
test_recall = metrics.recall_score(y_test,pred_test)
train_precision = metrics.precision_score(y_train,pred_train)
test_precision = metrics.precision_score(y_test,pred_test)
f1_score_train = metrics.f1_score(y_train,pred_train)
f1_score_test = metrics.f1_score(y_test,pred_test)
score_list.extend((train_acc,test_acc,train_recall,test_recall,train_precision,test_precision,f1_score_train,f1_score_test))
# If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
if flag == True:
print("Accuracy on training set : ",model.score(X_train,y_train))
print("Accuracy on test set : ",model.score(X_test,y_test))
print("Recall on training set : ",metrics.recall_score(y_train,pred_train))
print("Recall on test set : ",metrics.recall_score(y_test,pred_test))
print("Precision on training set : ",metrics.precision_score(y_train,pred_train))
print("Precision on test set : ",metrics.precision_score(y_test,pred_test))
print("F1 Score on training set : ",metrics.f1_score(y_train,pred_train))
print("F1 Score on test set : ",metrics.f1_score(y_test,pred_test))
return score_list # returning the list with train and test scores
dtree=DecisionTreeClassifier(random_state=1)
dtree.fit(X_train,y_train)
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=1)
dtree_model_perf=get_metrics_score(dtree)
Accuracy on training set : 1.0 Accuracy on test set : 0.8586065573770492 Recall on training set : 1.0 Recall on test set : 0.6376811594202898 Precision on training set : 1.0 Precision on test set : 0.6219081272084805 F1 Score on training set : 1.0 F1 Score on test set : 0.629695885509839
make_confusion_matrix(dtree,y_test)
feature_names = X_train.columns
# plot the decision tree
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
dtree,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=['1','0'],
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Choose the type of classifier.
dtree_tuned = DecisionTreeClassifier(random_state=1,class_weight = {0:.15,1:.85})
# Grid of parameters to choose from
parameters = {'max_depth': list(np.arange(10,60,10)) + [None],
"criterion": ["gini","entropy"],
'min_samples_leaf': [1, 3, 5, 7, 10],
'max_leaf_nodes' : [2, 3, 5, 10, 15] + [None],
'min_impurity_decrease': [0.001, 0.01, 0.1, 0.0]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(dtree_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
dtree_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
dtree_tuned.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, max_depth=20,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, max_depth=20,
random_state=1)dtree_tuned_model_perf=get_metrics_score(dtree_tuned)
Accuracy on training set : 0.997949619214997 Accuracy on test set : 0.8620218579234973 Recall on training set : 0.9984423676012462 Recall on test set : 0.6231884057971014 Precision on training set : 0.990726429675425 Precision on test set : 0.6370370370370371 F1 Score on training set : 0.9945694336695112 F1 Score on test set : 0.6300366300366301
make_confusion_matrix(dtree_tuned,y_test)
# plot the decision tree
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
dtree_tuned,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
Plotting the feature importance of each variable
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(pd.DataFrame(dtree_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp MonthlyIncome 0.196359 Age 0.195289 Passport_1 0.086588 NumberOfTrips 0.084495 Designation_Executive 0.061933 PreferredPropertyStar 0.057549 CityTier_3 0.046143 TypeofContact_Self Enquiry 0.037403 MaritalStatus_Single 0.030310 Occupation_Salaried 0.024263 Occupation_Small Business 0.021685 Gender_Male 0.021242 NumberOfChildrenVisiting 0.020321 MaritalStatus_Married 0.019223 OwnCar_1 0.019214 Designation_Senior Manager 0.016882 CityTier_2 0.014614 MaritalStatus_Unmarried 0.010150 Occupation_Large Business 0.009872 NumberOfPersonVisiting_2 0.007400 NumberOfPersonVisiting_4 0.006617 Designation_Manager 0.006528 NumberOfPersonVisiting_3 0.004073 Designation_VP 0.001847 NumberOfPersonVisiting_5 0.000000
feature_names = X_train.columns
importances = dtree_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
rf_estimator=RandomForestClassifier(random_state=1)
rf_estimator.fit(X_train,y_train)
RandomForestClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(random_state=1)
rf_estimator_model_perf=get_metrics_score(rf_estimator)
Accuracy on training set : 1.0 Accuracy on test set : 0.889344262295082 Recall on training set : 1.0 Recall on test set : 0.4782608695652174 Precision on training set : 1.0 Precision on test set : 0.88 F1 Score on training set : 1.0 F1 Score on test set : 0.619718309859155
make_confusion_matrix(rf_estimator,y_test)
# Choose the type of classifier.
rf_tuned = RandomForestClassifier(random_state=1,class_weight={0:0.15,1:0.85})
# Grid of parameters to choose from
parameters = {
'max_depth':[4, 6, 8, 10, None],
'max_features': ['sqrt','log2',None],
'n_estimators': np.arange(110,251,501),
'min_samples_leaf': np.arange(1,6,1),
"max_samples": np.arange(0.3, 0.7, None),
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(rf_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
rf_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
rf_tuned.fit(X_train, y_train)
RandomForestClassifier(class_weight={0: 0.15, 1: 0.85}, max_depth=10,
max_features=None, max_samples=0.3, min_samples_leaf=5,
n_estimators=110, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(class_weight={0: 0.15, 1: 0.85}, max_depth=10,
max_features=None, max_samples=0.3, min_samples_leaf=5,
n_estimators=110, random_state=1)# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(pd.DataFrame(rf_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp MonthlyIncome 0.178760 Age 0.154781 Passport_1 0.117664 Designation_Executive 0.082400 NumberOfTrips 0.068509 CityTier_3 0.059147 PreferredPropertyStar 0.055408 MaritalStatus_Single 0.032860 MaritalStatus_Married 0.031370 NumberOfChildrenVisiting 0.028735 Gender_Male 0.023175 TypeofContact_Self Enquiry 0.019838 OwnCar_1 0.018311 Designation_Senior Manager 0.017950 Occupation_Small Business 0.017766 Occupation_Salaried 0.016791 Designation_Manager 0.015342 NumberOfPersonVisiting_4 0.012237 NumberOfPersonVisiting_3 0.011274 NumberOfPersonVisiting_2 0.011098 MaritalStatus_Unmarried 0.010991 Occupation_Large Business 0.010289 CityTier_2 0.004548 Designation_VP 0.000755 NumberOfPersonVisiting_5 0.000000
feature_names = X_train.columns
importances = rf_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
#Calculating different metrics
rf_estimator_tuned_model_perf=get_metrics_score(rf_tuned)
Accuracy on training set : 0.8749267721148213 Accuracy on test set : 0.8299180327868853 Recall on training set : 0.6962616822429907 Recall on test set : 0.605072463768116 Precision on training set : 0.658321060382916 Precision on test set : 0.5439739413680782 F1 Score on training set : 0.6767600302800908 F1 Score on test set : 0.5728987993138938
make_confusion_matrix(rf_tuned,y_test)
# baggingClassifier
bagging_classifier = BaggingClassifier(random_state=1)
# fit the model on training dataset
bagging_classifier.fit(X_train, y_train)
BaggingClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
BaggingClassifier(random_state=1)
# check the scores on Training and Testing Datasets
bgc_score = get_metrics_score(bagging_classifier)
Accuracy on training set : 0.9882835383714118 Accuracy on test set : 0.8859289617486339 Recall on training set : 0.940809968847352 Recall on test set : 0.5144927536231884 Precision on training set : 0.9966996699669967 Precision on test set : 0.8114285714285714 F1 Score on training set : 0.9679487179487178 F1 Score on test set : 0.6297117516629711
make_confusion_matrix(bagging_classifier,y_test)
# BaggingClassifier with gini and class_weight for appropriate importance
bgc_dt = BaggingClassifier(base_estimator=DecisionTreeClassifier(criterion="gini",class_weight={0:0.15,1:0.85},random_state=1),random_state=1)
# fit the model on training set
bgc_dt.fit(X_train,y_train)
BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight={0: 0.15,
1: 0.85},
random_state=1),
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight={0: 0.15,
1: 0.85},
random_state=1),
random_state=1)DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=1)DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=1)bgcdt_score = get_metrics_score(bgc_dt)
Accuracy on training set : 0.9888693614528412 Accuracy on test set : 0.8743169398907104 Recall on training set : 0.9439252336448598 Recall on test set : 0.4492753623188406 Precision on training set : 0.9967105263157895 Precision on test set : 0.7948717948717948 F1 Score on training set : 0.9695999999999999 F1 Score on test set : 0.574074074074074
make_confusion_matrix(bgc_dt,y_test)
# set the parameters
parameters = {
"n_estimators":np.arange(10,60,10),
"max_features": [0.7,0.8,0.9],
"max_samples": [0.7,0.8,0.9],
}
# assigning Bootstrap = True to select features with Replacement
bgc1 = BaggingClassifier(random_state=1,bootstrap=True)
# type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# run the grid search
grid_obj = GridSearchCV(bgc1, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# set the clf to the best combination of parameters
bgcht = grid_obj.best_estimator_
# fit the best algorithm to the data.
bgcht.fit(X_train, y_train)
BaggingClassifier(max_features=0.9, max_samples=0.9, n_estimators=50,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. BaggingClassifier(max_features=0.9, max_samples=0.9, n_estimators=50,
random_state=1)bgcht_score = get_metrics_score(bgcht)
Accuracy on training set : 0.9997070884592852 Accuracy on test set : 0.89275956284153 Recall on training set : 0.9984423676012462 Recall on test set : 0.5072463768115942 Precision on training set : 1.0 Precision on test set : 0.8695652173913043 F1 Score on training set : 0.9992205767731879 F1 Score on test set : 0.6407322654462243
make_confusion_matrix(bgcht,y_test)
ab_Classifier=AdaBoostClassifier(random_state=1)
ab_Classifier.fit(X_train,y_train)
AdaBoostClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AdaBoostClassifier(random_state=1)
ab_classifier_score=get_metrics_score(ab_Classifier)
Accuracy on training set : 0.8427065026362038 Accuracy on test set : 0.8435792349726776 Recall on training set : 0.29439252336448596 Recall on test set : 0.286231884057971 Precision on training set : 0.6923076923076923 Precision on test set : 0.7117117117117117 F1 Score on training set : 0.4131147540983606 F1 Score on test set : 0.4082687338501292
make_confusion_matrix(ab_Classifier,y_test)
# Choose the type of classifier.
ab_tuned = AdaBoostClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
"base_estimator":[DecisionTreeClassifier(max_depth=1,random_state=1),DecisionTreeClassifier(max_depth=2,random_state=1),
DecisionTreeClassifier(max_depth=3,random_state=1)],
'n_estimators': np.arange(10,100,10),
'learning_rate': [1, 0.1, 0.5, 0.01],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(ab_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
ab_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
ab_tuned.fit(X_train, y_train)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=1, n_estimators=90, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=1, n_estimators=90, random_state=1)DecisionTreeClassifier(max_depth=3, random_state=1)
DecisionTreeClassifier(max_depth=3, random_state=1)
ab_tuned_classifier_score=get_metrics_score(ab_tuned)
Accuracy on training set : 0.9651435266549502 Accuracy on test set : 0.8367486338797814 Recall on training set : 0.8613707165109035 Recall on test set : 0.4384057971014493 Precision on training set : 0.9485420240137221 Precision on test set : 0.5902439024390244 F1 Score on training set : 0.9028571428571429 F1 Score on test set : 0.503118503118503
make_confusion_matrix(ab_tuned,y_test)
# importance of features in the tree building
print(pd.DataFrame(ab_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp MonthlyIncome 0.458006 Age 0.148775 NumberOfTrips 0.058559 PreferredPropertyStar 0.039890 Gender_Male 0.032700 CityTier_3 0.026406 Passport_1 0.024563 TypeofContact_Self Enquiry 0.022094 Occupation_Large Business 0.021684 MaritalStatus_Single 0.019671 Designation_Senior Manager 0.017846 NumberOfPersonVisiting_3 0.015082 Designation_Manager 0.013833 CityTier_2 0.012884 NumberOfPersonVisiting_2 0.011968 Occupation_Small Business 0.011691 NumberOfChildrenVisiting 0.011246 MaritalStatus_Married 0.010745 Designation_Executive 0.010637 MaritalStatus_Unmarried 0.008271 OwnCar_1 0.006643 NumberOfPersonVisiting_4 0.005714 Occupation_Salaried 0.005693 Designation_VP 0.005400 NumberOfPersonVisiting_5 0.000000
feature_names = X_train.columns
importances = ab_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
gb_estimator=GradientBoostingClassifier(random_state=1)
gb_estimator.fit(X_train,y_train)
GradientBoostingClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GradientBoostingClassifier(random_state=1)
gb_classsifier_score=get_metrics_score(gb_estimator)
Accuracy on training set : 0.8807850029291154 Accuracy on test set : 0.8620218579234973 Recall on training set : 0.4485981308411215 Recall on test set : 0.37318840579710144 Precision on training set : 0.844574780058651 Precision on test set : 0.7803030303030303 F1 Score on training set : 0.5859613428280773 F1 Score on test set : 0.5049019607843137
make_confusion_matrix(gb_estimator,y_test)
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(pd.DataFrame(gb_estimator.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp MonthlyIncome 0.186460 Passport_1 0.167246 Designation_Executive 0.153909 Age 0.128787 CityTier_3 0.065791 MaritalStatus_Single 0.051824 PreferredPropertyStar 0.050960 NumberOfTrips 0.040990 Designation_Senior Manager 0.022816 MaritalStatus_Unmarried 0.020969 MaritalStatus_Married 0.019103 Occupation_Large Business 0.017114 TypeofContact_Self Enquiry 0.014515 Designation_Manager 0.012870 Gender_Male 0.009807 Occupation_Small Business 0.008556 CityTier_2 0.008083 NumberOfChildrenVisiting 0.006847 NumberOfPersonVisiting_4 0.004305 Occupation_Salaried 0.003647 NumberOfPersonVisiting_3 0.002907 OwnCar_1 0.001103 NumberOfPersonVisiting_2 0.001070 Designation_VP 0.000321 NumberOfPersonVisiting_5 0.000000
# Choose the type of classifier.
gb_tuned = GradientBoostingClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {'n_estimators': np.arange(50,200,25),
'subsample':[0.7,0.8,0.9,1],
'max_features':[0.7,0.8,0.9,1],
'max_depth':[3,5,7,10]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(gb_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
gb_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
gb_tuned.fit(X_train, y_train)
GradientBoostingClassifier(max_depth=7, max_features=0.8, n_estimators=175,
random_state=1, subsample=0.9)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(max_depth=7, max_features=0.8, n_estimators=175,
random_state=1, subsample=0.9)# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(pd.DataFrame(gb_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp MonthlyIncome 0.282319 Age 0.160916 Passport_1 0.079050 NumberOfTrips 0.072592 Designation_Executive 0.064096 PreferredPropertyStar 0.049152 CityTier_3 0.038202 Gender_Male 0.026933 MaritalStatus_Single 0.025179 TypeofContact_Self Enquiry 0.024940 NumberOfChildrenVisiting 0.020667 MaritalStatus_Unmarried 0.019320 Occupation_Large Business 0.017635 Designation_Manager 0.015362 Occupation_Small Business 0.014726 MaritalStatus_Married 0.013858 OwnCar_1 0.012689 NumberOfPersonVisiting_2 0.012624 Designation_Senior Manager 0.012553 Occupation_Salaried 0.011307 NumberOfPersonVisiting_4 0.008404 NumberOfPersonVisiting_3 0.007115 CityTier_2 0.006614 Designation_VP 0.003708 NumberOfPersonVisiting_5 0.000036
feature_names = X_train.columns
importances = gb_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
gbm_tuned_classifier_score=get_metrics_score(gb_tuned)
Accuracy on training set : 1.0 Accuracy on test set : 0.907103825136612 Recall on training set : 1.0 Recall on test set : 0.5978260869565217 Precision on training set : 1.0 Precision on test set : 0.868421052631579 F1 Score on training set : 1.0 F1 Score on test set : 0.7081545064377682
make_confusion_matrix(gb_tuned,y_test)
xgb_estimator=XGBClassifier(random_state=1, eval_metric='logloss')
xgb_estimator.fit(X_train,y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=None, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, n_estimators=100, n_jobs=None,
num_parallel_tree=None, predictor=None, random_state=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=None, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, n_estimators=100, n_jobs=None,
num_parallel_tree=None, predictor=None, random_state=1, ...)XGBClassifier_score=get_metrics_score(xgb_estimator)
Accuracy on training set : 0.9956063268892794 Accuracy on test set : 0.8941256830601093 Recall on training set : 0.9766355140186916 Recall on test set : 0.5615942028985508 Precision on training set : 1.0 Precision on test set : 0.8201058201058201 F1 Score on training set : 0.988179669030733 F1 Score on test set : 0.6666666666666667
make_confusion_matrix(xgb_estimator,y_test)
# Choose the type of classifier.
xgb_tuned = XGBClassifier(random_state=1,eval_metric='logloss')
# Grid of parameters to choose from
parameters = {'n_estimators': [75,100,125,150],
'subsample':[0.7, 0.8, 0.9, 1],
'gamma':[0, 1, 3, 5],
'colsample_bytree':[0.7, 0.8, 0.9, 1],
'colsample_bylevel':[0.7, 0.8, 0.9, 1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(xgb_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
xgb_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
xgb_tuned.fit(X_train, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=1, colsample_bynode=None, colsample_bytree=1,
early_stopping_rounds=None, enable_categorical=False,
eval_metric='logloss', feature_types=None, gamma=0, gpu_id=None,
grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=150, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=1, colsample_bynode=None, colsample_bytree=1,
early_stopping_rounds=None, enable_categorical=False,
eval_metric='logloss', feature_types=None, gamma=0, gpu_id=None,
grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=150, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=1, ...)feature_names = X_train.columns
importances = xgb_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
xgb_tuned_classifier_score=get_metrics_score(xgb_tuned)
Accuracy on training set : 0.9991212653778558 Accuracy on test set : 0.8961748633879781 Recall on training set : 0.9953271028037384 Recall on test set : 0.5797101449275363 Precision on training set : 1.0 Precision on test set : 0.8163265306122449 F1 Score on training set : 0.9976580796252927 F1 Score on test set : 0.6779661016949153
make_confusion_matrix(xgb_tuned,y_test)
Now, let's build a stacking model with the tuned models - decision tree, random forest, and gradient boosting, then use XGBoost to get the final prediction.
estimators=[('Decision Tree', dtree),('Bagging Classifier', bagging_classifier)]
final_estimator=RandomForestClassifier(random_state=1)
stacking_estimator=StackingClassifier(estimators=estimators, final_estimator=final_estimator,cv=5)
stacking_estimator.fit(X_train,y_train)
StackingClassifier(cv=5,
estimators=[('Decision Tree',
DecisionTreeClassifier(random_state=1)),
('Bagging Classifier',
BaggingClassifier(random_state=1))],
final_estimator=RandomForestClassifier(random_state=1))In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. StackingClassifier(cv=5,
estimators=[('Decision Tree',
DecisionTreeClassifier(random_state=1)),
('Bagging Classifier',
BaggingClassifier(random_state=1))],
final_estimator=RandomForestClassifier(random_state=1))DecisionTreeClassifier(random_state=1)
BaggingClassifier(random_state=1)
RandomForestClassifier(random_state=1)
stacking_classifier_score=get_metrics_score(stacking_estimator)
Accuracy on training set : 0.9988283538371412 Accuracy on test set : 0.8920765027322405 Recall on training set : 0.9968847352024922 Recall on test set : 0.6413043478260869 Precision on training set : 0.9968847352024922 Precision on test set : 0.75 F1 Score on training set : 0.9968847352024922 F1 Score on test set : 0.69140625
make_confusion_matrix(stacking_estimator,y_test)
estimators=[('AB Classifier', ab_Classifier),('Gradient Boosting Classifier', gb_estimator)]
final_estimator=XGBClassifier(random_state=1)
stacking_estimator_boosting=StackingClassifier(estimators=estimators, final_estimator=final_estimator,cv=5)
stacking_estimator_boosting.fit(X_train,y_train)
StackingClassifier(cv=5,
estimators=[('AB Classifier',
AdaBoostClassifier(random_state=1)),
('Gradient Boosting Classifier',
GradientBoostingClassifier(random_state=1))],
final_estimator=XGBClassifier(base_score=None, booster=None,
callbacks=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None,
early_stopping_rounds=None,
enable_categorical=...
gpu_id=None, grow_policy=None,
importance_type=None,
interaction_constraints=None,
learning_rate=None,
max_bin=None,
max_cat_threshold=None,
max_cat_to_onehot=None,
max_delta_step=None,
max_depth=None,
max_leaves=None,
min_child_weight=None,
missing=nan,
monotone_constraints=None,
n_estimators=100, n_jobs=None,
num_parallel_tree=None,
predictor=None, random_state=1, ...))In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. StackingClassifier(cv=5,
estimators=[('AB Classifier',
AdaBoostClassifier(random_state=1)),
('Gradient Boosting Classifier',
GradientBoostingClassifier(random_state=1))],
final_estimator=XGBClassifier(base_score=None, booster=None,
callbacks=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None,
early_stopping_rounds=None,
enable_categorical=...
gpu_id=None, grow_policy=None,
importance_type=None,
interaction_constraints=None,
learning_rate=None,
max_bin=None,
max_cat_threshold=None,
max_cat_to_onehot=None,
max_delta_step=None,
max_depth=None,
max_leaves=None,
min_child_weight=None,
missing=nan,
monotone_constraints=None,
n_estimators=100, n_jobs=None,
num_parallel_tree=None,
predictor=None, random_state=1, ...))AdaBoostClassifier(random_state=1)
GradientBoostingClassifier(random_state=1)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=1, ...)stacking_classifier_score1=get_metrics_score(stacking_estimator_boosting)
Accuracy on training set : 0.8646748681898067 Accuracy on test set : 0.8490437158469946 Recall on training set : 0.48286604361370716 Recall on test set : 0.427536231884058 Precision on training set : 0.7045454545454546 Precision on test set : 0.6519337016574586 F1 Score on training set : 0.5730129390018485 F1 Score on test set : 0.5164113785557987
# defining list of models
models = [dtree, dtree_tuned, rf_estimator, rf_tuned, bagging_classifier, bgc_dt,bgcht, ab_Classifier, ab_tuned, gb_estimator,gb_tuned, xgb_estimator, xgb_tuned,stacking_estimator,stacking_estimator_boosting]
# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_scorefinal_train = []
f1_scorefinal_test = []
# looping through all the models to get the accuracy, precall and precision scores
for model in models:
j = get_metrics_score(model,False)
acc_train.append(np.round(j[0],2))
acc_test.append(np.round(j[1],2))
recall_train.append(np.round(j[2],2))
recall_test.append(np.round(j[3],2))
precision_train.append(np.round(j[4],2))
precision_test.append(np.round(j[5],2))
f1_scorefinal_train.append(np.round(j[6],2))
f1_scorefinal_test.append(np.round(j[7],2))
comparison_frame = pd.DataFrame({'Model':['Decision Tree', 'Decision Tree with HyperParameterTuning', 'Random Forest', 'Random Forest-HyperParameterTuning', 'Bagging Classifier', 'Bagging Classifier-DT','Bagging Classifier-Tuned', 'AdaBoost with default paramters','AdaBoost Tuned',
'Gradient Boosting with default parameters',
'Gradient Boosting Tuned','XGBoost with default parameters','XGBoost Tuned','Stacking with Bagging Algo','Stacking with Boosting Algo'],
'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
'Train_Recall':recall_train,'Test_Recall':recall_test,
'Train_Precision':precision_train,'Test_Precision':precision_test,
'Train_F1Score': f1_scorefinal_train,'Test_F1Score':f1_scorefinal_test})
comparison_frame
| Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | Train_F1Score | Test_F1Score | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Decision Tree | 1.00 | 0.86 | 1.00 | 0.64 | 1.00 | 0.62 | 1.00 | 0.63 |
| 1 | Decision Tree with HyperParameterTuning | 1.00 | 0.86 | 1.00 | 0.62 | 0.99 | 0.64 | 0.99 | 0.63 |
| 2 | Random Forest | 1.00 | 0.89 | 1.00 | 0.48 | 1.00 | 0.88 | 1.00 | 0.62 |
| 3 | Random Forest-HyperParameterTuning | 0.87 | 0.83 | 0.70 | 0.61 | 0.66 | 0.54 | 0.68 | 0.57 |
| 4 | Bagging Classifier | 0.99 | 0.89 | 0.94 | 0.51 | 1.00 | 0.81 | 0.97 | 0.63 |
| 5 | Bagging Classifier-DT | 0.99 | 0.87 | 0.94 | 0.45 | 1.00 | 0.79 | 0.97 | 0.57 |
| 6 | Bagging Classifier-Tuned | 1.00 | 0.89 | 1.00 | 0.51 | 1.00 | 0.87 | 1.00 | 0.64 |
| 7 | AdaBoost with default paramters | 0.84 | 0.84 | 0.29 | 0.29 | 0.69 | 0.71 | 0.41 | 0.41 |
| 8 | AdaBoost Tuned | 0.97 | 0.84 | 0.86 | 0.44 | 0.95 | 0.59 | 0.90 | 0.50 |
| 9 | Gradient Boosting with default parameters | 0.88 | 0.86 | 0.45 | 0.37 | 0.84 | 0.78 | 0.59 | 0.50 |
| 10 | Gradient Boosting Tuned | 1.00 | 0.91 | 1.00 | 0.60 | 1.00 | 0.87 | 1.00 | 0.71 |
| 11 | XGBoost with default parameters | 1.00 | 0.89 | 0.98 | 0.56 | 1.00 | 0.82 | 0.99 | 0.67 |
| 12 | XGBoost Tuned | 1.00 | 0.90 | 1.00 | 0.58 | 1.00 | 0.82 | 1.00 | 0.68 |
| 13 | Stacking with Bagging Algo | 1.00 | 0.89 | 1.00 | 0.64 | 1.00 | 0.75 | 1.00 | 0.69 |
| 14 | Stacking with Boosting Algo | 0.86 | 0.85 | 0.48 | 0.43 | 0.70 | 0.65 | 0.57 | 0.52 |
------------ THE END ---------------